Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-4.11] Bug 2118586: on-prem: improvements on resolv-prepender #3287

Conversation

openshift-cherrypick-robot

This is an automated cherry-pick of #3271

/assign mandre

Currently a NetworkManager dispatcher script does not have the correct
selinux permission to dbus chat with hostnamed. Work around the issue
using systemd-run.

See: https://bugzilla.redhat.com/show_bug.cgi?id=2111632

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
If resolve prepender takes more than NetworkManager timeout, currently
90s, it might fail to bring up devices before we had a chance to
process all possible events for a device. This needs to account for
different type of events and IPv4 and IPv6 events in case of dual stack
and overall take less then the NetworkManager timeout.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
node-ip can fail if a device is not ready to be bound to. Retry but
don't add to the overall timeout more than the NetworkManager timeout
(90s) accounting for all the events we need to attend to.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Make resolv-prepender wait for nameservers in
/run/NetworkManager/resolv.conf in all cases to avoid copying it without
them to /etc/resolv.conf

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
Without a properly configured resolv.conf, openshift-dns coredns will
fail to run. These pods have a default DNS policy and will use the host
resolv.conf, which is the one kubelet gets when it starts.

Signed-off-by: Jaime Caamaño Ruiz <[email protected]>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 16, 2022

@openshift-cherrypick-robot: Bugzilla bug 2105003 has been cloned as Bugzilla bug 2118586. Retitling PR to link against new bug.
/retitle [release-4.11] Bug 2118586: on-prem: improvements on resolv-prepender

In response to this:

[release-4.11] Bug 2105003: on-prem: improvements on resolv-prepender

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot changed the title [release-4.11] Bug 2105003: on-prem: improvements on resolv-prepender [release-4.11] Bug 2118586: on-prem: improvements on resolv-prepender Aug 16, 2022
@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 16, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 16, 2022

@openshift-cherrypick-robot: This pull request references Bugzilla bug 2118586, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.z) matches configured target release for branch (4.11.z)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
  • dependent bug Bugzilla bug 2105003 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE))
  • dependent Bugzilla bug 2105003 targets the "4.12.0" release, which is one of the valid target releases: 4.12.0
  • bug has dependents

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

[release-4.11] Bug 2118586: on-prem: improvements on resolv-prepender

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jcaamano
Copy link
Contributor

/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere
/test e2e-ovirt

@knobunc
Copy link
Contributor

knobunc commented Aug 17, 2022

/approve

@jcaamano
Copy link
Contributor

/assign @sinnykumari

@sinnykumari
Copy link
Contributor

should e2e-vsphere and e2e-openstack be green as well?

@cybertron
Copy link
Member

/test e2e-openstack
/test e2e-vsphere
/lgtm
/label backport-risk-assessed

Ideally, yes.

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Aug 17, 2022
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 17, 2022
@sinnykumari
Copy link
Contributor

e2e-vsphere and e2e-openstack are still failing. I am adding my approval and will leave this to the on-prem team to decide when this is ready to get merged. Feel free to remove hold when this looks fine.
/hold
/approve
/test e2e-openstack
/test e2e-vsphere

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 18, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 18, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cybertron, knobunc, openshift-cherrypick-robot, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 18, 2022
@jcaamano
Copy link
Contributor

I see no history of those jobs passing in this repo/branch.
I don't see any related issues for the job e2e-vsphere
On e2e-openstack last job the dns operator is complaining but that's because something else failed while it was still progressing. Past jobs don't seem to have any related issues. Checked some of the journals and all seems normal.

Triggering once more
/test e2e-openstack
/test e2e-vsphere

@jcaamano
Copy link
Contributor

Great, the openstack job now failed with the all too familiar Timed out waiting for node count (5) to equal or exceed machine count (6)
There is still a chance this is unrelated.
I created a dummy PR to crosscheck.
#3297

@jcaamano
Copy link
Contributor

/test e2e-openstack
/test e2e-vsphere

@mandre
Copy link
Member

mandre commented Aug 19, 2022

Great, the openstack job now failed with the all too familiar Timed out waiting for node count (5) to equal or exceed machine count (6) There is still a chance this is unrelated. I created a dummy PR to crosscheck. #3297

According to the machine log, bzlr9p1b-174af-v968k-worker-0-4jvtf failed validation with:

"errorMessage": "Machine validation failed: \nError getting a new instance service from the machine: Failed to get cloud from secret: Failed to get secrets from kubernetes api: Get \"https://172.30.0.1:443/api/v1/namespaces/openshift-machine-api/secrets/openstack-cloud-credentials\": dial tcp 172.30.0.1:443: i/o timeout - error from a previous attempt: read tcp 10.130.0.12:49270-\u003e172.30.0.1:443: read: connection reset by peer",
"errorReason": "InvalidConfiguration",

It looks like the API server was not reachable at the time the node made the request, however DNS seemed to work, so apparently a different issue (most likely infra related).

@jcaamano
Copy link
Contributor

last openstack job had problems in master-0, weird issues with openshift-sdn and must-gather was not able to collect node logs. While this does not look very good, I can't still tie anything specific to these changes.

At least the vsphere job passed

/test e2e-openstack

@mandre
Copy link
Member

mandre commented Aug 19, 2022

This time, the openstack job failed again with the same Timed out waiting for node count (5) to equal or exceed machine count (6) error, except this time machine sz3jc4kz-174af-qck5g-worker-0-9h59q is in Provisioned status.

Logs from the instance show the following error:

[  385.651927] overlayfs: failed to resolve '/var/lib/containers/storage/overlay/l/RNRNIUQQVY6AO6BZOTQXMWC4KJ': -2

I'm not quite sure what causes it.

@mandre
Copy link
Member

mandre commented Aug 19, 2022

@mandre again the node count issue, no error but there's only 2 workers and both are worker-0, how's that?

That's because they were created by the worker-0 machineset. It's the convention we use in OpenStack, where the installer suffix the machinesets with an index representing an AZ number. Nothing to worry about. If we had another AZ, we would have worker-1 machineset and so on.

@jcaamano
Copy link
Contributor

I am trying to run this on my own with cluster bot. In the meantime...

/test e2e-openstack

@jcaamano
Copy link
Contributor

cluster bot was able to run the cluster with no issues

/test e2e-openstack

@jcaamano
Copy link
Contributor

On the last run, now just some tests fail
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3287/pull-ci-openshift-machine-config-operator-release-4.11-e2e-openstack/1561664665711284224

Of those, many I have seen also failing in the test PR #3297 job
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3297/pull-ci-openshift-machine-config-operator-release-4.11-e2e-openstack/1560604743808585728

including
[sig-arch] events should not repeat pathologically
[bz-etcd][invariant] alert/etcdGRPCRequestsSlow should not be at or above info
[bz-etcd][invariant] alert/etcdMemberCommunicationSlow should not be at or above info

then there are some tests that fail in one and not in the other and viceversa.

It looks to me that this infra is very sensible to this job but I at least managed to pass it once on my test PR.

/test e2e-openstack

@mandre
Copy link
Member

mandre commented Aug 22, 2022

OK, this last run looks better. We're getting a lot of etcd related test failures on openstack recently due to the underlying infra, so the job failures aren't too alarming. I'd be willing to merge now if needed.

@cybertron
Copy link
Member

I think we're all in agreement that the openstack failure unlikely to be caused by this patch, so we can go ahead and merge without that job in this instance.

@sinnykumari
Copy link
Contributor

Removing hold as e2e-openstack test failure is unrelated
/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 22, 2022
@rbbratta
Copy link
Contributor

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Aug 22, 2022
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 2 against base HEAD 54a105e and 8 for PR HEAD 5b18d21 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 1 against base HEAD 54a105e and 7 for PR HEAD 5b18d21 in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 23, 2022

@openshift-cherrypick-robot: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-disruptive 5b18d21 link false /test e2e-aws-disruptive
ci/prow/e2e-aws-upgrade-single-node 5b18d21 link false /test e2e-aws-upgrade-single-node
ci/prow/e2e-aws-single-node 5b18d21 link false /test e2e-aws-single-node
ci/prow/e2e-openstack 5b18d21 link false /test e2e-openstack

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit d33d8dc into openshift:release-4.11 Aug 23, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 23, 2022

@openshift-cherrypick-robot: All pull requests linked via external trackers have merged:

Bugzilla bug 2118586 has been moved to the MODIFIED state.

In response to this:

[release-4.11] Bug 2118586: on-prem: improvements on resolv-prepender

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants